1 Introduction

This exercise is designed to determine hypotheses of phylogenetic relationships between major biogeographic regions by means of assessing taxonomic relationships between all taxa known from these mountains. Taxa differ with regards to being described to Genera, species, and subspecies, and the hierarchical effects of these relationships sheds light on how similar populations are between different mountain ranges.

Here, we use kmeans clustering and hclust hierarchical clustering. I determine ideal group size of kmeans using the gap-statistic, and the ideal size for hclust using the more qualitative elbow method, with guidance from the kmeans group size.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-7
library(ape)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(dendextend)
## Registered S3 method overwritten by 'dendextend':
##   method     from 
##   rev.hclust vegan
## 
## ---------------------
## Welcome to dendextend version 1.15.1
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## Or contact: <tal.galili@gmail.com>
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------
## 
## Attaching package: 'dendextend'
## The following objects are masked from 'package:ape':
## 
##     ladderize, rotate
## The following object is masked from 'package:permute':
## 
##     shuffle
## The following object is masked from 'package:stats':
## 
##     cutree
x=as.data.frame(read_csv(paste0(filepath,"TableS1_v2.csv")))
## Rows: 732 Columns: 51
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): From Bowie, From Dowsett, Exclude, Genus, Superspecies, Species, G...
## dbl (43): Clements, Bioko, Mt. Cameroon, Cameroon Highlands, Bamenda & Adama...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(x)
##   From Bowie        From Dowsett         Exclude             Clements    
##  Length:732         Length:732         Length:732         Min.   :  329  
##  Class :character   Class :character   Class :character   1st Qu.:21948  
##  Mode  :character   Mode  :character   Mode  :character   Median :23890  
##                                                           Mean   :22811  
##                                                           3rd Qu.:28628  
##                                                           Max.   :31354  
##     Genus           Superspecies         Species             Group          
##  Length:732         Length:732         Length:732         Length:732        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   Subspecies            Bioko          Mt. Cameroon     Cameroon Highlands
##  Length:732         Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   
##  Class :character   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   
##  Mode  :character   Median :0.00000   Median :0.00000   Median :0.00000   
##                     Mean   :0.04918   Mean   :0.06831   Mean   :0.07104   
##                     3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   
##                     Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   
##  Bamenda & Adamawa     Lendu           West Rift         Rwenzori     
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.00000   Median :0.00000   Median :0.0000   Median :0.0000  
##  Mean   :0.08743   Mean   :0.09426   Mean   :0.1803   Mean   :0.1339  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.0000   Max.   :1.0000  
##    East Rift          Kabobo        West Ethiopia    East Ethiopia  
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.000  
##  Median :0.0000   Median :0.00000   Median :0.0000   Median :0.000  
##  Mean   :0.1639   Mean   :0.09016   Mean   :0.1407   Mean   :0.127  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.000  
##   S Eth-N Ken         Imatong            Elgon          West Kenya    
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.00000   Median :0.00000   Median :0.0000   Median :0.0000  
##  Mean   :0.04645   Mean   :0.08607   Mean   :0.1298   Mean   :0.1475  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.0000   Max.   :1.0000  
##  Kenya-Aberdare     Ngorongoro          Meru         Kilimanjaro   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.000  
##  Mean   :0.1448   Mean   :0.1148   Mean   :0.1148   Mean   :0.112  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.000  
##      Taita              Pare            Usambara           Nguu        
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.0000   Median :0.00000  
##  Mean   :0.05874   Mean   :0.08333   Mean   :0.1107   Mean   :0.05601  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.0000   Max.   :1.00000  
##      Nguru            Ukaguru           Rubeho           Uluguru      
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.00000   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.06284   Mean   :0.0724   Mean   :0.06967   Mean   :0.1052  
##  3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000  
##     Udzungwa      Southern Highlands     Nyika          Kaningina      
##  Min.   :0.0000   Min.   :0.000      Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.000      1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.000      Median :0.0000   Median :0.00000  
##  Mean   :0.1202   Mean   :0.112      Mean   :0.1107   Mean   :0.08197  
##  3rd Qu.:0.0000   3rd Qu.:0.000      3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.000      Max.   :1.0000   Max.   :1.00000  
##   Dedza-Salima         Zomba             Thyolo           Mulanje       
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.06557   Mean   :0.06831   Mean   :0.06831   Mean   :0.06694  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##      Namuli          Gorongosa        Chimanimani      N Drakensberg   
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.0000  
##  Mean   :0.05601   Mean   :0.04508   Mean   :0.07377   Mean   :0.0806  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.0000  
##  S Drakensberg    Cape Fold Mountains     Angola      
##  Min.   :0.0000   Min.   :0.00000     Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000     1st Qu.:0.0000  
##  Median :0.0000   Median :0.00000     Median :0.0000  
##  Mean   :0.0929   Mean   :0.06694     Mean   :0.0806  
##  3rd Qu.:0.0000   3rd Qu.:0.00000     3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.00000     Max.   :1.0000
x[is.na(x)]=0
colSums(x[,-c(1:9)])
##               Bioko        Mt. Cameroon  Cameroon Highlands   Bamenda & Adamawa 
##                  36                  50                  52                  64 
##               Lendu           West Rift            Rwenzori           East Rift 
##                  69                 132                  98                 120 
##              Kabobo       West Ethiopia       East Ethiopia         S Eth-N Ken 
##                  66                 103                  93                  34 
##             Imatong               Elgon          West Kenya      Kenya-Aberdare 
##                  63                  95                 108                 106 
##          Ngorongoro                Meru         Kilimanjaro               Taita 
##                  84                  84                  82                  43 
##                Pare            Usambara                Nguu               Nguru 
##                  61                  81                  41                  46 
##             Ukaguru              Rubeho             Uluguru            Udzungwa 
##                  53                  51                  77                  88 
##  Southern Highlands               Nyika           Kaningina        Dedza-Salima 
##                  82                  81                  60                  48 
##               Zomba              Thyolo             Mulanje              Namuli 
##                  50                  50                  49                  41 
##           Gorongosa         Chimanimani       N Drakensberg       S Drakensberg 
##                  33                  54                  59                  68 
## Cape Fold Mountains              Angola 
##                  49                  59
colnames(x)
##  [1] "From Bowie"          "From Dowsett"        "Exclude"            
##  [4] "Clements"            "Genus"               "Superspecies"       
##  [7] "Species"             "Group"               "Subspecies"         
## [10] "Bioko"               "Mt. Cameroon"        "Cameroon Highlands" 
## [13] "Bamenda & Adamawa"   "Lendu"               "West Rift"          
## [16] "Rwenzori"            "East Rift"           "Kabobo"             
## [19] "West Ethiopia"       "East Ethiopia"       "S Eth-N Ken"        
## [22] "Imatong"             "Elgon"               "West Kenya"         
## [25] "Kenya-Aberdare"      "Ngorongoro"          "Meru"               
## [28] "Kilimanjaro"         "Taita"               "Pare"               
## [31] "Usambara"            "Nguu"                "Nguru"              
## [34] "Ukaguru"             "Rubeho"              "Uluguru"            
## [37] "Udzungwa"            "Southern Highlands"  "Nyika"              
## [40] "Kaningina"           "Dedza-Salima"        "Zomba"              
## [43] "Thyolo"              "Mulanje"             "Namuli"             
## [46] "Gorongosa"           "Chimanimani"         "N Drakensberg"      
## [49] "S Drakensberg"       "Cape Fold Mountains" "Angola"
x2=x

# reformat names to be hierarchical

x2$Superspecies=paste(x2$Genus,x2$Superspecies)
x2$Species=paste(x2$Superspecies,x2$Species)
x2$Group=paste(x2$Species,x2$Group)
x2$Subspecies=paste(x2$Group,x2$Subspecies)

2 Clustering Code & Genus

# test variables for coding
level="Genus"
ncluster=9
hcluster=9
xdata=x2
author="Cooper"

# removed ncluster variable
# now determines best group number

clustertaxa=function(level,ncluster=NULL,hcluster=NULL,xdata,author){
  
  if(is.null(ncluster)==T){ncluster=5}
  if(is.null(hcluster)==T){hcluster=5}
  
  x3=xdata %>%
    filter(Exclude!="Exclude") %>%
    select(-`From Bowie`,-`From Dowsett`,-Exclude,-Clements)
  
  xnames=x3[,which(colnames(x3)==level)]
  
  x4=x3%>%select(-Superspecies,-Genus,
                 -Species,-Group,-Subspecies)
  
  col.x=colnames(x4)
  x4=as.data.frame(unclass(t(x4)))
  
  colnames(x4)=xnames
  
  for(i in 1:ncol(x4)){
    x4[,i]=as.numeric(as.character(x4[,i]))
  }
  
  u.names=unique(xnames)
  
  for(i in 1:length(u.names)){
    target=u.names[i]
    if(sum(colnames(x4)==target)<=1){
      index=which(colnames(x4)==target)
      nu.x=x4[,c(index,index)]
      if(i==1){
        x6=nu.x
      }else{
        x6=cbind(x6,nu.x[,1])
      }
      if(i==2){
        x6=x6[,-2]
      }
    }
    if(sum(colnames(x4)==target)>1){
      xx=x4[,which(colnames(x4)==target)]
      nu.x=rowSums(xx)
      nu.x[nu.x>1]=1
      if(i==1){
        x6=nu.x
      }else{
        x6=cbind(x6,nu.x)
      }
      if(i==2){
        x6=x6[,-2]
      }
    }
  }
  colnames(x6)=u.names
  
  # manual plot for kmeans
  
  #wss=(nrow(x6)-1)*sum(apply(x6,2,var))
  #for(v in 2:40){
  #  wss[v]=sum(kmeans(x6,centers=v)$withinss)
  #}
  
  #plot(1:40,wss,type="b",xlab="Number of Clusters",
  #   ylab="Within groups sum of squares")
  
  set.seed(123)
  
  # ncluster.det=which(wss==min(wss))
  
  # defined from above plot
  
  clust.x=hclust(dist(x6),method="average")
  
  wss=(nrow(x6)-1)*sum(apply(x6,2,var))
  x.cut=cutree(clust.x,2:40)

  for(v in 2:30){ # reducing number to be plotted
    x.ssq=aggregate(x6,by=list(x.cut[,v]),function(x){sum(scale(x,scale=F)^2)})
    ssq=rowSums(x.ssq[,-1])
    TSS=sum(x.ssq[,-1])
    wss[v]=TSS
  }
  
  # use hcluster here as a comparison to the kmeans clusters
  
  plot(x=1:30,y=wss,pch=19,type='b',xlab="Number of Clusters",
     ylab="Within groups sum of squares",main="H-Clust")
  abline(v=hcluster)
  
  plot(clust.x,main="Elbow Clust")
  rect.hclust(clust.x,k=hcluster)
  
  plot(clust.x,main="#K Clust")
  rect.hclust(clust.x,k=ncluster)
  
  x.tree=as.phylo(clust.x)
  
  write.tree(x.tree,file=paste0(filepath,"hclust_",level,"_",author,".tre"))
  
  #for(k in 1:nclusters){
  #  xclust=kmeans(x6,nclusters[k],nstart=20)
  #  return(xclust)
  #}
  
  print(fviz_nbclust(x6,kmeans,nstart=2,method="gap_stat",
               nboot=100,k.max=30)+
    labs(subtitle = "Kmeans: Gap Statistic"))
  
  xclust=kmeans(x=x6,centers=ncluster)
  assignments=as.data.frame(xclust$cluster)
  write.csv(assignments,
            paste0(filepath,"clusters_",level,
                   "_K",ncluster,"_",author,".csv"),
            row.names = T,quote = F)
}
# prepare variables for tests
level="Superspecies"
#nclusters=5

# ensure we are excluding problem

xdata=x2 %>% filter(Exclude==0)
bowie.data=xdata%>%filter(`From Bowie`!=0)
dowsett.data=xdata%>%filter(`From Dowsett`!=0)
# my data

clustertaxa(level="Genus",xdata=xdata,
            ncluster=10,hcluster=6,
            author="Cooper")

# Bowie data

clustertaxa(level="Genus",xdata=bowie.data,
            ncluster=4,hcluster=7,
            author="Bowie")

# Dowsett data

clustertaxa(level="Genus",xdata=dowsett.data,
            ncluster=4,hcluster=8,
            author="Dowsett")

3 Superspecies Clusters

# this study

clustertaxa(level="Superspecies",xdata=xdata,
            ncluster=10,hcluster=7,
            author="Cooper")

# Bowie data

clustertaxa(level="Superspecies",xdata=bowie.data,
            ncluster=20,hcluster=5,
            author="Bowie")

clustertaxa(level="Superspecies",xdata=dowsett.data,
            ncluster=13,hcluster=4,
            author="Dowsett")

4 Species Cluster

clustertaxa(level="Species",xdata=xdata,
            ncluster=10,hcluster=4,
            author="Cooper")

clustertaxa(level="Species",xdata=bowie.data,
            ncluster=10,hcluster=4,
            author="Bowie")

clustertaxa(level="Species",xdata=dowsett.data,
            ncluster=5,hcluster=5,
            author="Dowsett")

5 Group

clustertaxa(level="Group",xdata=xdata,
            ncluster=16,hcluster=4,
            author="Cooper")

clustertaxa(level="Group",xdata=bowie.data,
            ncluster=3,hcluster=6,
            author="Bowie")

clustertaxa(level="Group",xdata=dowsett.data,
            ncluster=10,hcluster=5,
            author="Dowsett")

6 Subspecies

clustertaxa(level="Subspecies",xdata=xdata,
            ncluster=8,hcluster=6,
            author="Cooper")

clustertaxa(level="Subspecies",xdata=bowie.data,
            ncluster=11,hcluster=8,
            author="Bowie")

clustertaxa(level="Subspecies",xdata=dowsett.data,
            ncluster=12,hcluster=5,
            author="Dowsett")

7 Number of groups across assessments

The following is the “ideal” number of groups (partitions) within each dendrogram for each taxonomic treatise.

KMeans

Source Genus Superspecies Species Group Subspecies \(\mu\) \(\sigma\)
Dowsett (1986) 4 13 5 10 12 8.8 4.1
Bowie (2003) 4 20 10 3 11 9.6 6.8
This study 10 10 10 16 8 10.8 3.0
Overall \(\mu\) 6 14.3 8.3 9.6 10.3 9.73
Overall \(\sigma\) 3.5 5.1 2.9 6.5 2.1
Dowsett=cbind(1:length(Dowsett),"Dowsett",Dowsett)
Bowie=cbind(1:length(Bowie),"Bowie",Bowie)
This.study=cbind(1:length(This.study),"This study",This.study)

colnames(Dowsett)=colnames(Bowie)=colnames(This.study)=c("X","Study","N.Clusters")

dat.x=data.frame(rbind(Dowsett,Bowie,This.study))
a=ggplot(dat.x,aes(x=X,y=N.Clusters,group=Study))
b=geom_line(aes(linetype=Study))
b2=geom_point()
c=theme_classic()
d=ylim(0,22)
e=xlab('Level')
e2=ylab("# of Clusters")
e3=scale_x_discrete(breaks=c(1,2,3,4,5),labels=c("Genus","Superspecies","Species",
                             "Group","Subspecies"))
  
plot1=a+b+c+d+b2+e+e2+e3
print(plot1)

ggsave(plot1,filename = paste0(filepath,"number_groups.jpg"),dpi = 400)
## Saving 7 x 5 in image

Hierarchical Clustering

Note That this is a subjective measure.

Source Genus Superspecies Species Group Subspecies \(\mu\) \(\sigma\)
Dowsett (1986) 8 4 5 5 5 5.4 1.5
Bowie (2003) 7 5 4 6 8 6 1.6
This study 9 7 4 4 6 6 2.1
Overall \(\mu\) 8.0 5.3 4,3 5.0 6.3 5.8
Overall \(\sigma\) 1.0 1.5 0.6 1.0 1.5
a=ggplot(dat.x,aes(x=X,y=N.Clusters,group=Study))
b=geom_line(aes(linetype=Study))
b2=geom_point()
c=theme_classic()
d=ylim(0,22)
e=xlab('Level')
e2=ylab("# of Clusters")
e3=scale_x_discrete(breaks=c(1,2,3,4,5),labels=c("Genus","Superspecies","Species",
                             "Group","Subspecies"))
  
plot1=a+b+c+d+b2+e+e2+e3
print(plot1)

ggsave(plot1,filename = paste0(filepath,"number_groups.jpg"),dpi = 400)
## Saving 7 x 5 in image

Difference between Elbow and K-Means

Reported as \(kmeans - hclust\), as the gap-statistic is a less fallible measure. Thus, positive numbers have higher numbers of clusters for kmeans and negative values have higher numbers of clusters for the elbow method. The difference between these values can fluctuate greatly between iterations of the pipeline, especially with regards to the ideal number of K-means clusters.

Source Genus Superspecies Species Group Subspecies
Dowsett (1986) -4 9 0 5 7
Bowie (2003) -3 15 6 -3 3
This study 1 3 6 12 2